Functional Annotation of Proteins through Substructure Matching

نویسندگان

  • Mark Moll
  • Drew H. Bryant
  • Lydia E. Kavraki
چکیده

The number of known protein structures is rapidly increasing. The function of most proteins is still poorly understood or even completely unknown. At the same time, there are also many large protein families for which many structural variants are available. We present a substructure-based approach called LabelHash that can be used to annotate proteins with unknown function. We also describe a new method that uses LabelHash as a tool to help understand the structural variations within classes of proteins with known function. This structural variation within a family of related proteins can be exploited to design drugs with very high specificity. Determining the function of unannotated proteins would have a significant impact on understanding diseases and designing new therapeutics. However, experimental protein function determination is expensive and very time-consuming. Computational methods can facilitate function determination by identifying proteins that have high structural and chemical similarity. Our focus is on methods that determine binding site similarity. Although several such methods exist, it still remains a challenging problem to quickly find all functionally-related matches for structural motifs in the entire Protein Data Bank (PDB) with high specificity. In this context, a structural motif is a set of 3D points annotated with physicochemical information that characterize a molecular function. We have developed a method called LabelHash that creates a hash table of n-tuples of residues for all structures in the PDB [2]. The method is inspired by geometric hashing, a technique that originated in computer vision but which has also been applied to matching structural motifs [6, 5]. The key advantage of LabelHash over geometric hashing is that it uses much less space and scales more easily to very large data sets such as the entire PDB. Using the LabelHash hash tables, we can quickly look up partial matches to a motif and expand those matches to complete matches. We show that by applying only very mild geometric constraints we can find statistically significant matches with extremely high sensitivity and specificity for very general structural motifs (see Figure 1). The LabelHash method is also extremely fast; it can match motifs ranging in size from 3 to 11 residues in a matter of seconds to minutes to all structures in the 95% sequence identity filtered non-redundant PDB. A web server front-end for LabelHash as well as a command line version are available at http://labelhash.kavrakilab.org [3]. The LabelHash method is sufficiently fast that it can be used to perform a detailed analysis of the structural variability within large protein families or even superfamilies. Structural variations caused by a wide range of physicochemical and biological sources directly influence the function of a protein. Comparative analysis of drug-receptor substructures across and within species has been used for lead evaluation. Substructurelevel similarity between the binding sites of functionally similar proteins has also been used to identify instances of convergent evolution among proteins. The Family-wise Analysis of SubStructural Templates (FASST) method uses LabelHash for allagainst-all substructure comparison to determine substructural clusters [1]. Substructural clusters characterize the binding site substructural variation within a protein family. We focus on examples of automatically determined substructural clusters that can be linked to phylogenetic distance between family members (see Figure 2), segregation by conformation, and organization by homology among convergent protein lineages. The Motif Ensemble Statistical Hypothesis (MESH) framework constructs a representative motif for each protein cluster among the substructural clusters determined by FASST to build motif Fig. 1. A substructure match (in green) shown superimposed with a motif (in white), while the rest of the matching protein is shown in ribbon representation. Figure reproduced from [2].

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Functional Annotation of Two Hypothetical Proteins Reveals Valuable Proteins Involved in Response to Salinity: An in silico Approach

Through the exponential development in the specification of sequences and structures of proteins by genome sequencing and structural genomics approaches, there is a growing demand for valid bioinformatics methods to define these proteins function. In this study, our objective is to identify the function of unknown proteins from UCB-1 pistachio rootstock and specify their class...

متن کامل

The LabelHash Server and Tools for substructure-based functional annotation

SUMMARY The LabelHash server and tools are designed for large-scale substructure comparison. The main use is to predict the function of unknown proteins. Given a set of (putative) functional residues, LabelHash finds all occurrences of matching substructures in the entire Protein Data Bank, along with a statistical significance estimate and known functional annotations for each match. The resul...

متن کامل

PocketAnnotate: towards site-based function annotation

A computational pipeline PocketAnnotate for functional annotation of proteins at the level of binding sites has been proposed in this study. The pipeline integrates three in-house algorithms for site-based function annotation: PocketDepth, for prediction of binding sites in protein structures; PocketMatch, for rapid comparison of binding sites and PocketAlign, to obtain detailed alignment betwe...

متن کامل

مقایسه نتایج خوشه‌بندی سلسله مراتبی و غیرسلسله مراتبی پروتئین‌های مرتبط با سرطان‌های مری، معده و کلون براساس تشابهات تفسیر هستی‌شناسی ژنی

Background and Objective: Using proteomic methodologies and advent of high-throughput (HTP) investigation of proteins has created a need for new approaches in bioinformatics analysis of experimental results. Cluster analysis is a suitable statistical procedure that can be useful for analyzing these data sets.   Materials and Methods: In this research study, the identified proteins associated wi...

متن کامل

De-Orphaning the Structural Proteome through Reciprocal Comparison of Evolutionarily Important Structural Features

Function prediction frequently relies on comparing genes or gene products to search for relevant similarities. Because the number of protein structures with unknown function is mushrooming, however, we asked here whether such comparisons could be improved by focusing narrowly on the key functional features of protein structures, as defined by the Evolutionary Trace (ET). Therefore a series of a...

متن کامل

Protein Structural Motifs: Identification, Annotation and Use in Function Prediction

While functional motifs are commonly detected and studied in protein sequences, few three-dimensional (3D) motifs, that is sets of residues spatially close in three dimensions but not necessarily adjacent in the sequence, have been identified so far, mostly through manual approaches. However, structural motifs may reveal novel and important functional sites and allow the detection of evolutiona...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014